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Abstract 

Most existing word embedding methods 
can be categorized into Neural Embedding 
Models and Matrix Factorization (MF)- 
based methods. However some mod¬ 
els are opaque to probabilistic interpre¬ 
tation, and MF-based methods, typically 
solved using Singular Value Decomposi¬ 
tion (SVD), may incur loss of corpus in¬ 
formation. In addition, it is desirable to 
incorporate global latent factors, such as 
topics, sentiments or writing styles, into 
the word embedding model. Since gen¬ 
erative models provide a principled way 
to incorporate latent factors, we propose a 
generative word embedding model, which 
is easy to interpret, and can serve as a 
basis of more sophisticated latent factor 
models. The model inference reduces to 
a low rank weighted positive semidefinite 
approximation problem. Its optimization 
is approached by eigendecomposition on a 
submatrix, followed by online blockwise 
regression, which is scalable and avoids 
the information loss in SVD. In experi¬ 
ments on 7 common benchmark datasets, 
our vectors are competitive to word2vec, 
and better than other MF-based methods. 

1 Introduction 

The task of word embedding is to model the distri¬ 
bution of a word and its context words using their 
corresponding vectors in a Euclidean space. Then 
by doing regression on the relevant statistics de¬ 
rived from a corpus, a set of vectors are recovered 
which best fit these statistics. These vectors, com¬ 
monly referred to as the embeddings, capture se¬ 
mantic/syntactic regularities between the words. 

The core of a word embedding method is the 
link function that connects the input — the embed¬ 
dings, with the output — certain corpus statistics. 


Based on the link function, the objective function 
is developed. The reasonableness of the link func¬ 
tion impacts the quality of the obtained embed¬ 
dings, and different link functions are amenable 
to different optimization algorithms, with different 
scalability. Based on the forms of the link func¬ 
tion and the optimization techniques, most meth¬ 
ods can be divided into two classes: the traditional 
neural embedding models, and more recent low 
rank matrix factorization methods. 


The neural embedding models use the softmax 
link function to model the conditional distribution 
of a word given its context (or vice versa) as a 
function of the embeddings. The normalizer in the 
softmax function brings intricacy to the optimiza¬ 
tion, which is usually tackled by gradient-based 


methods. The pioneering work was (Bengio et 


al., 20061. Eater Mnih and Hinton (20071 propose 


three different link functions. However there are 
interaction matrices between the embeddings in all 
these models, which complicate and slow down 
the training, hindering them from being trained 
on huge corpora. [Mikolov et al. (2013a ) and 
Mikolov et al. (2013b|) greatly simplify the condi¬ 


tional distribution, where the two embeddings in¬ 
teract directly. They implemented the well-known 
“word2vec”, which can be trained efficiently on 
huge corpora. The obtained embeddings show ex¬ 
cellent performance on various tasks. 


Fow-Rank Matrix Factorization (MF in short) 
methods include various link functions and opti¬ 
mization methods. The link functions are usu¬ 
ally not softmax functions. MF methods aim to 
reconstruct certain corpus statistics matrix by the 
product of two low rank factor matrices. The ob¬ 
jective is usually to minimize the reconstruction 
error, optionally with other constraints. In this 
line of research, Fevy and Goldberg (20141 find 
that “word2vec” is essentially doing stochastic 
weighted factorization of the word-context point- 
wise mutual information (PMI) matrix. They then 












factorize this matrix directly as a new method. 
Pennington et al. (2014| ) propose a bilinear regres¬ 
sion function of the conditional distribution, from 
which a weighted MF problem on the bigram log- 
frequency matrix is formulated. Gradient Descent 
is used to find the embeddings. Recently, based 
on the intuition that words can be organized in se¬ 
mantic hierarchies, Yogatama et al. (2015| l add hi¬ 
erarchical sparse regularizers to the matrix recon¬ 
struction error. With similar techniques, [Faruqui 


et al. (20151 reconstruct a set of pretrained embed¬ 
dings using sparse vectors of greater dimensional¬ 
ity. Dhillon et al. (20l3] ) apply Canonical Corre¬ 
lation Analysis (CCA) to the word matrix and the 
context matrix, and use the canonical correlation 
vectors between the two matrices as word embed¬ 
dings. Stratos et al. (2014| | and Stratos et al. (20151 
assume a Brown language model, and prove that 
doing CCA on the bigram occurrences is equiva¬ 
lent to finding a fransformed solution of fhe lan¬ 
guage model. [Arora et al. (2015 1 assume there is a 
hidden discourse vector on a random walk, which 
determines the distribution of the current word. 
The slowly evolving discourse vector puts a con¬ 
straint on the embeddings in a small text window. 
The maximum likelihood estimate of the embed¬ 
dings within this text window approximately re¬ 
duces to a squared norm objective. 

There are two limitations in current word em¬ 
bedding methods. The first limitation is, all MF- 
based methods map words and their context words 
to two different sets of embeddings, and then em¬ 
ploy Singular Value Decomposition (SVD) to ob¬ 
tain a low rank approximation of the word-context 
matrix M. As SVD factorizes M, some in¬ 
formation in M is lost, and the learned embed¬ 
dings may not capture the most significant regu¬ 
larities in M. [Appendix A gives a toy example on 
which SVD does not work properly. 

The second limitation is, a generative model for 
documents parametered by embeddings is absent 
in recent development. Although ( [Stratos et al., 
2014 Stratos et ah, 2015| [Arora et ah, 2015 1 are 
based on generative processes, the generative pro¬ 
cesses are only for deriving the local relationship 
between embeddings within a small text window, 
leaving the likelihood of a document undefined. 
In addifion, fhe learning objecfives of some mod¬ 
els, e.g. ( Mikolov ef ah, 2013b[ Eq.l), even have 
no clear probabilisfic inferprefafion. A genera¬ 
tive word embedding model for documenfs is nol 


only easier fo inferpref and analyze, buf more im- 
porfanfly, provides a basis upon which documenf- 
level global lafenf facfors, such as documenf topics 
( Wallach, 2006[ l, senfimenfs ( Lin and He, 2009 1, 
writing styles (Zhao el ah, 201 lb I, can be incor¬ 
porated in a principled manner, to heller model fhe 
lexf distribution and extract relevant information. 

Based on the above considerations, we pro¬ 
pose to unify the embeddings of words and con¬ 
text words. Our link function factorizes into three 
parts: the interaction of two embeddings capturing 
linear correlations of two words, a residual captur¬ 
ing nonlinear or noisy correlations, and the uni¬ 
gram priors. To reduce overfitting, we put Gaus¬ 
sian priors on embeddings and residuals, and ap¬ 
ply Jelinek-Mercer Smoothing to bigrams. Fur¬ 
thermore, to model the probability of a sequence 
of words, we assume that the contributions of 
more than one context word approximately add up. 
Thereby a generative model of documents is con¬ 
structed, parameterized by embeddings and resid¬ 
uals. The learning objective is to maximize the 
corpus likelihood, which reduces to a weighted 
low-rank positive semidefinite (PSD) approxima¬ 
tion problem of the PMI matrix. A Block Co¬ 
ordinate Descent algorithm is adopted to find an 
approximale solution. This algorifhm is based 
on Eigendecomposifion, which avoids information 
loss in SVD, buf brings challenges fo scalabilify. 
We fhen exploif fhe sparsify of fhe weighf mafrix 
and implemenf an efficienf online blockwise re¬ 
gression algorifhm. On seven benchmark dafasefs 
covering similarity and analogy fasks, our mefhod 
achieves compefifive and sfable performance. 

The source code of fhis mefhod is provided af 
https://github.com/askerlee/topicvec 


2 Notations and Definitions 

Throughouf fhe paper, we always use a uppercase 
bold letter as S, V fo denofe a mafrix or sef, a low¬ 
ercase bold letter as to denote a vector, a nor¬ 
mal uppercase letter as N, W to denote a scalar 
constant, and a normal lowercase letter as Si , Wi to 
denote a scalar variable. 

Suppose a vocabulary S = {si, • • • , con¬ 
sists of all the words, where W is the vocab¬ 
ulary size. We further suppose si,--- ,sw are 
sorted in decending order of the frequency, i.e. 
Si is most frequent, and sw is least frequent. 
A document di is a sequence of words di = 
{wii, • • • , WiLi),Wij e S. A corpus is a collec- 







































Name 

Description 

S 

Vocabulary {si, • • • , Sw} 

V 

Embedding matrix , ■ • ■ , ) 

D 

Corpus {di, • ■ ■ Mm} 

Vsi 

Embedding of word Si 

Us-s^j 

Bigram residual for Si , Sj 

P{Si,Sj) 

Empirical probability of Si,Sj in the corpus 

u 

Unigram probability vector (P(si),- • ■, P{sw)) 

A 

Residual matrix {aa^sj ) 

B 

Conditional probability matrix |si)^ 

G 

PMI matrix 

H 

Bigram empirical probability matrix ^P{si, Sj)^ 


Table 1: Notation Table 


tion of M documents D = {di, • • • , In the 
vocabulary, each word Si is mapped to a vector 
in N-dimensional Euclidean space. 

In a document, a sequence of words is referred 
to as a text window, denoted by ruj, • • • , Wi+i, or 
wf.Wi^i in shorthand. A text window of chosen 
size c before a word Wi defines the context of Wi 
as Wi-c, • • • , Wi-i- Here Wi is referred to as the 
focus word. Each context word Wi-j and the focus 
word Wi comprise a bigram Wi-j, Wi. 

The Pointwise Mutual Information between two 
words Si, Sj is defined as 

3 Link Function of Text 

In this section, we formulate the probability of a 
sequence of words as a function of their embed¬ 
dings. We start from the link function of bigrams, 
which is the building blocks of a long sequence. 
Then this link function is extended to a text win¬ 
dow with c context words, as a first-order approx¬ 
imation of the actual probability. 

3.1 Link Function of Bigrams 


by multiplying together. If Si and sj are indepen¬ 
dent, their joint probability should be P{si)P{sj). 
In the presence of correlations, the actual joint 
probability P{si,Sj) would be a scaling of it. The 
scale factor reflects how much Si and sj are pos¬ 
itively or negatively correlated. Within the scale 
factor, vJ.Vg, captures linear interactions between 
Si and Sj, the residual captures nonlinear or 
noisy interactions. In applications, only vJ.Vg. is 
of interest. Hence the bigger magnitude vJ^Vg- is 
of relative to ag^g■, the better. 

Note that we do not assume = as■g^. 

This provides the flexibility P{si, Sj) P{sj, Si), 
agreeing with the asymmetry of bigrams in natu¬ 
ral languages. At the same time, vJ.Vg- imposes a 
symmetric part between P{si, Sj) and P{sj, Si). 

o is equivalent to 


P(sj|si)=exp|r;J^,Us, -f .+logP(sj)| , (2) 

log ■ (3) 

Q of all bigrams is represented in matrix form: 


V^V + A = G, (4) 


where G is the PMI matrix. 


3.1.1 Gaussian Priors on Embeddings 

When o is employed on the regression of empir¬ 
ical bigram probabilities, a practical issue arises: 
more and more bigrams have zero frequency as 
the constituting words become less frequent. A 
zero-frequency bigram does not necessarily imply 
negative correlation between the two constituting 
words; it could simply result from missing data. 
But in this case, even after smoothing, o will 
force Vg.Vg^ + as^g■ to be a big negative number, 
making Vg. overly long. The increased magnitude 
of embeddings is a sign of overfitting. 

To reduce overfitting of embeddings of infre¬ 
quent words, we assign a Spherical Gaussian prior 
■^(0’ 2^--^) to '’Si- 


We generalize the link function of “word2vec” and 
“GloVe” to the following: 

P{si,Sj) = exp|i;J.i;^, + ag^g^'^P{si)P{sj) (1) 

The rationale for Q originates from the idea of 
the Product of Experts in (Hinton, 2002). Sup¬ 
pose different types of semantic/syntactic regu¬ 
larities between Si and Sj are encoded in differ¬ 
ent dimensions of Do 


, Do 


As expl^T DsJ = 
explus-^z • Vg.j}, this means the effects of dif¬ 
ferent regularities on the probability are combined 


P{vg^) ~ exp{-/ri||D^Jp}, 

where the hyperparameter pi increases as the fre¬ 
quency of Si decreases. 

3.1.2 Gaussian Priors on Residuals 

We wish v]_Vg^ in ([T]l captures as much corre¬ 
lations between Si and Sj as possible. Thus the 
smaller Og^g is, the better. In addition, the more 
frequent Si,Sj is in the corpus, the less noise 
there is in their empirical distribution, and thus the 
residual Og^g- should be more heavily penalized. 









To this end, we penalize the residual Us^sj 
by where /(•) is a nonnega¬ 

tive monotonic transformation, referred to as the 
weighting function. Let hij denote P{si, sj), then 
the total penalty of all residuals are the square of 
the weighted Frobenius norm of A: 

X] ( 5 ) 


By referring to “GloVe”, we use the following 
weighting function, and find it performs well: 




< ^cut, i j 

' 1 y /hij P C*cut) i f j ^ 
0 i = j 


where Ccut is chosen to cut the most frequent 
0.02% of the bigrams off at 1. When Si = sj, two 
identical words usually have much smaller proba¬ 
bility to collocate. Hence P{si,Si) does not reflect 
the true correlation of a word to itself, and should 
not put constraints to the embeddings. We elimi¬ 
nate their effects by setting f (ha) to 0. 

If the domain of A is the whole space , 

then this penalty is equivalent to a Gaussian prior 
J\f ^ 0 , 2 ]fh~)^ odiCh OsiSj- The variances of the 
Gaussians are determined by the bigram empirical 
probability matrix H. 


3.1.3 Jelinek-Mercer Smoothing of Bigrams 

As another measure to reduce the impact of miss¬ 
ing data, we apply the commonly used Jelinek- 
Mercer Smoothing (Zhai and Lafferty, 20041 
to smooth the empirical conditional probability 
by the unigram probability P{sj) as: 

.^smoothed ('Sj I Sj) — (1 k) P(^Sj\Si^-\- KP(^sf. (6) 

Accordingly, the smoothed bigram empirical 
joint probability is defined as 

P{Si,Sj) = {l-K)P{Si,Sj) + KP{si)P{Sj). (7) 


In pracfice, we find k = 0.02 yields good re- 
sulfs. When k > 0.04, fhe obfained embeddings 
begin fo degrade wifh k, indicafing fhaf smoofhing 
distorts fhe frue bigram disfribufions. 


3.2 Link Function of a Text Window 

In the previous subsection, a regression link func¬ 
tion of bigram probabilities is established. In 
this section, we adopt a first-order approximation 
based on Information Theory, and extend the link 
function to a longer sequence wq, - ■ ■ , Wc-i,Wc- 
Decomposing a distribution conditioned on n 
random variables as the conditional distributions 


on its subsets roots deeply in Information The¬ 
ory. This is an intricate problem because there 
could be both (pointwise) redundant information 
and (pointwise) synergistic information among the 


conditioning variables (Williams and Beer, 20101. 
They are both functions of the PMI. Based on an 
analysis of the complementing roles of these two 
types of pointwise information, we assume they 
are approximately equal and cancel each other 
when computing the pointwise interaction infor¬ 
mation. See Appendix B for a detailed discussion. 

Following the above assumption, we have 
PMI(t(;2; wo,wi) « PMI(t(;2; mo) -|-PMI(m2; mi): 
p(mo,mi|m2) , P(mi|m2) 

p(mo,mi) p(mo) ^ P(mi) ' 

Plugging ([T]) and (|^ into the above, we obtain 

P(mo,mi,m2) 

.2 2 

« exp E + ClwiWj) + E log P{wi) 


i,j=0 


i=0 


We extend the above assumption to that the 
pointwise interaction information is still close to 
0 within a longer text window. Accordingly the 
above equation extends to a context of size c > 2: 

P(mo, • • • ,Wc) 

^ C C 

~ exp<^ -F Y log P{wi) 


i,j=0 


i=0 


From it derives the conditional distribution of 
Wc, given its context mo, • • • , mc-i: 

P(mo, • • • ,Wc) 


P{wc I mo : mc-i) = 


f C —1 C —1 

iP{wc) exp 


P(mo, • • • ,mc-i) 

O'WiWc f • (^) 


i=0 


i=0 


4 Generative Process and Likelihood 


We proceed to assume the text is generated from a 
Markov chain of order c, i.e., a word only depends 
on words within its context of size c. Given the 
hyperparameter p = {pi, ■ ■ •, p^y), the generative 
process of the whole corpus is: 

1. For each word s,, draw the embedding v.. 
from AA(0, 2 ^/); 

2. For each bigram Si,Sj, draw the residual 

a,,,,from AA(0,2 j^); 

3. For each document di, for the j-th word, 

draw word Wij from S with probability 
P{wij I Wij-c : defined by ([^. 














Figure 1: The Graphical Model of PSDVec 


The above generative process for a document d is 
presented as a graphical model in Figure 

Based on this generative process, the probabil¬ 
ity of a document di can be derived as follows, 
given the embeddings and residuals V, A: 

P{di\V,A) 

Li r i-1 j-i 

j=l k=j-c k=j-c 


5 Learning Algorithm 
5.1 Learning Objective 

The learning objective is to find the embeddings 
V that maximize the corpus log-likelihood Q. 

Let Xij denote the (smoothed) frequency of bi¬ 
gram Si, Sj in the corpus. Then Q is sorted as: 


logp{D,V,A) 

w 

■Co - log Z{H, Pi)-\\A\\}^^f,^- 

i=l 

w,w 

+ Xij{Vg.Vs^ +as^sj)- 

(10) 

i,j=i 


As the corpus size 

increases. 


will dominate the 


parameter prior terms. Then we can ignore the 
prior terms when maximizing ( [T0| ). 

max Y Xij+as,sj) 

= Xij) • Y -^smoothed(Si, Sj) log P{Si, Sj 

As both {Fsmoothed(Si, Sj)} and 

sum to 1, the above sum is maximized when 
P(^Si,Sj) — Psmoothed{Si, Sj). 

The maximum likelihood estimator is then: 


The complete-data likelihood of the corpus is: 
p{D,V,A) 

w,w / ^ 

2fJY 


w 


M 


Pi^jl^i} — -PBrnoothedCSjl Si), 

T , 1 TsuioothediSj Sj) 

v^.Vsj + as,sj = log-—. (11) 


w,w w 


P(s,) 


1 


Z{H,fi) 




Writing ( pTj ) in matrix form: 

— (-Psmoothed(5j 




i=l 

M,Li , j-1 j-1 . 

• P{wij) exp| ^ Y dwikWij r, 

i,j=l ^ k=j-c k=j-c ^ 

where Z{H, fi) is the normalizing constant. 

Taking the logarithm of both sides of 
p{D, A, V) yields 


Si,Sj^S 

G* = logS* — logri (g) (1 • • • 1), (12) 

where “(g)” is the outer product. 

Now we fix fhe values of -I- a, , af fhe 

above optimal. The corpus likelihood becomes 

w 


\ogp{D,V,A) =Ci-||A| 


f{H) 


Yd'iW'" 


2 

Si II , 


i=l 


subjecffo V^V + A = G* 


(13) 


logp{D,V,A) 

w 


=Co-logZ{H,n) - \\A\\j^fj^-Y ILiWvsi 



1 = 1 

M,L, 


+ E 

O-WikWijj 

*j=i 

k=j—c k=j—c ^ 

where Cq ■ 

= log P{wij) is constant. 


where Ci — Cq -|- Xg logF\moothed(Sii Sj) 

log Z{H, pL) is consfanf. 

5.2 Learning V as Low Rank PSD 
Approximation 

Once G* has been esfimafed from fhe corpus using 
we seek V fhaf maximizes This is fo 
find fhe maximum a posferiori (MAP) esfimafes 
of V, A fhaf satisfy V^V + A = G*. Applying 
this constraint to ([T3]), we obtain 










Algorithm 1 BCD algorithm for finding a unreg¬ 
ularized rank-Ai weighted PSD approximant. 
Input: matrix G*, weight matrix W = f{H), 
iteration number T, rank N 

Randomly initialize 
for t = 1, • • • , T do 

Gt = W oG* + {l-W)o 

= PSD^pproximate(Gt,tV) 

end for 

\,Q = EigenJDecomposition(X*^^)) 

F* = diag(Ai[l:iV]) -Q^fkiV] 

Output: V* 


argmax logp{D, V, A) 

V 


w 


= argmin||G*-y^F||/(j/) (14) 

^ i=l 


Let X = V^V. Then X is positive semidef- 
inite of rank N. Finding V that minimizes ( [T4| ) 
is equivalent to finding a rank-X weighted posi¬ 
tive semidefinite approximant X of G*, subject to 
Tikhonov regularization. This problem does not 
admit an analytic solution, and can only be solved 
using local optimization methods. 

First we consider a simpler case where all the 
words in the vocabulary are enough frequent, and 
thus Tikhonov regularization is unnecessary. In 
this case, we set Mpi = 0, and ( pA] ) becomes an 
unregularized optimization problem. We adopt the 
Block Coordinate Descent (BCD) algorithrrQ in 
( Srebro et ah, 2003] l to approach this problem. The 
original algorithm is to find a generic rank-X ma¬ 
trix for a weighted approximation problem, and 
we tailor it by constraining the matrix within the 
positive semidefinite manifold. 

We summarize our learning algorithm in Al¬ 
gorithm Here “o” is the entry-wise prod¬ 
uct. We suppose the eigenvalues A returned by 
Eigen_Decomposition(X) are in descending or¬ 
der. Q^[1:X] extracts the 1 to X rows from . 


One ke y issue is how to initialize X. Srebro et 
al. (2003 1 suggest to set = G*, and point out 

that X*-^^ = 0 is far from a local optimum, thus 
requires more iterations. However we find G* is 
also far from a local optimum, and this setting con¬ 
verges slowly too. Setting X*-®^ = G* /2 usually 


*It is referred to as an Expectation-Maximization algo¬ 
rithm by the original authors, but we think this is a misnomer. 


yields a satisfactory solution in a few iterations. 

The subroutine PSDk^pproximate() computes 
the unweighted nearest rank-X PSD approxima¬ 
tion, measured in F-norm ([Higham, 1988|l. 


5.3 Online Blockwise Regression of V 

In Algorithm [T] the essential subroutine 
PSD_Approximate() does eigendecomposi- 
tion on Gt, which is dense due to the logarithm 
transformation. Eigendecomposition on a VF x VF 
dense matrix requires O(VF^) space and O(IF^) 
time, difficult to scale up to a large vocabulary. In 
addition, the majority of words in the vocabulary 
are infrequent, and Tikhonov regularization is 
necessary for them. 

It is observed that, as words become less fre¬ 
quent, fewer and fewer words appear around them 
to form bigrams. Remind that the vocabulary 
S = {si, • • • , sw} are sorted in decending or¬ 
der of the frequency, hence the lower-right blocks 
of H and f{H) are very sparse, and cause these 
blocks in ( fid] ) to contribute much less penalty rela¬ 
tive to other regions. Therefore these blocks could 
be ignored when doing regression, without sacri¬ 
ficing too much accuracy. This intuition leads to 
the following online blockwise regression. 

The basic idea is to select a small set (e.g. 
30,000) of the most frequent words as the core 
words, and partition the remaining noncore words 
into sets of moderate sizes. Bigrams consist¬ 
ing of two core words are referred to as core bi¬ 
grams, which correspond to the top-left blocks of 
G and f{H). The embeddings of core words 
are learned approximately using Algorithm [TJ on 
the top-left blocks of G and f{H). Then we fix 
the embeddings of core words, and find the em¬ 
beddings of each set of noncore words in turn. 
After ignoring the lower-right regions of G and 
f{H) which correspond to bigrams of two non¬ 
core words, the quadratic terms of noncore em¬ 
beddings are ignored. Consequently, finding these 
embeddings becomes a weighted ridge regression 
problem, which can be solved efficiently in closed- 
form. Finally we combine all embeddings to get 
the embeddings of the whole vocabulary. The de¬ 
tails are as follows: 

1. Partition S into K consecutive groups 

Si, - ■ ■ , Sk- Take X = 3 as an example. 

The first group is core words; 

2. Accordingly partition G into K x K blocks. 












/ Gii 

Gi 2 

Gi3 

in this example as G 21 

G 22 

G 23 

V G 31 

G 32 

G 33 


Partition f{H),A in the same way. 
Gii, All correspond to core bi- 

{Vi I V 2 V 3 ) 

grams. Partition V into ; 

Si S2 S3 

3. Solve V 1 V 1 + All = Gii using Algorithm 
[T| and obtain core embeddings V*i ; 

4. Set Vi = V\, and find V 2 that minimizes 
the total penalty of the 12 -th and 21 -th blocks 
of residuals (the 22 -th block is ignored due to 
its high sparsity): 

argmin ||Gi 2 - Vi'^ 2 \\‘f{H)i 2 

V2 

+ \\G2i-VlVi\\}^H)2i+ E 

sieS 2 

= argmin||Gi2-F[F2||J-(jj)^2-f , 

SieS 2 


where f{H)i 2 = f{H)i 2 + f{H)Ji, 
Gi2 = (Gi 2 0 fiH)i2 + GJi o f{H)^i) 


H)i 2 + f { 11)21 j ’^he weighted aver¬ 
age of Gi 2 and GJi, “o” and “/” are element¬ 
wise product and division, respectively. The 
columns in 1^2 are independent, thus for each 
Vsi, it is a separate weighted ridge regression 
problem, whose solution is ( Holland, 1973| ): 

■V[diag(/Jffi, 


r;:=(F[diag(/jFi+^,/ 


where /j and are columns corresponding 
to Si in f{H)i 2 and Gu, respectively; 


5. For any other set of noncore words S^, find 
V*f^ that minimizes the total penalty of the Ik- 
th and /cl-th blocks, ignoring all other kj-th 
and jk-th blocks; 


6 . Combine all subsets of embeddings to form 
V*. UereV* = {V*i,V* 2 ,Vl). 


6 Experimental Results 

We trained our model along with a few state-of- 
the-art competitors on Wikipedia, and evaluated 
the embeddings on 7 common benchmark sets. 

6.1 Experimental Setup 

Our own method is referred to as PSD. The com¬ 
petitors include: 

word2vec|5 or 

^https://code.google.com/p/word 2 vec/ 


(Mikolov et ah, 2013b 


SONS in some literature; 


( [Levy and Goldberg, 2014 1 : the PPMI ma¬ 
trix without dimension reduction, and SVD 
of PPMI matrix, both yielded by hyperwords; 
( Pennington et ah, 2014| : GloVtQ 
( Stratos et ah, 201^ : Singular^ which does 
SVD-based CCA on the weighted bigram fre¬ 
quency matrix; 

( Faruqui et ah, 201^ : Spars^ which learns 
new sparse embeddings in a higher dimen¬ 
sional space from pretrained embeddings. 


All models were trained on the English Wikipedia 
snapshot in March 2015. After removing non¬ 
textual elements and non-English words, 2.04 bil¬ 
lion words were left. We used the default hyperpa¬ 
rameters in Hyperwords when training PPMI and 
SVD. Word2vec, GloVe and Singular were trained 
with their own default hyperparameters. 

The embedding sets PSD-Reg-180K and PSD- 
Unreg-180K were trained using our online block- 
wise regression. Both sets contain the embed¬ 
dings of the most frequent 180,000 words, based 
on 25,000 core words. PSD-Unreg-180K was 
traind with all gi = 0, i.e. disabling Tikhonov 
regularization. PSD-Reg-180K was trained with 


= \ 


2 i e [25001,80000] 

4 i £ [80001,130000] 

8 i £ [130001,180000] 


i.e. increased 


regularization as the sparsity increases. To con¬ 
trast with the batch learning performance, the per¬ 
formance of PSD-25K is listed, which contains the 
core embeddings only. PSD-25K took advantages 
that it contains much less false candidate words, 
and some test tuples (generally harder ones) were 
not evaluated due to missing words, thus its scores 
are not comparable to others. 

Sparse was trained with PSD-180K-reg as the 
input embeddings, with default hyperparameters. 

The benchmark sets are almost identical to 
those in (|Eevy et ah, 20151, except that (Luong et 


ah, 2013|l’s Rare Words is not included, as many 


rare words are cut off at the frequency 100 , mak¬ 
ing more than 1/3 of test pairs invalid. 

Word Similarity There are 5 datasets: Word- 
Sim Similarity (WS Sim) and WordSim Related¬ 
ness (WS Rel) ( [Zesch et ah, 200^ Agirre et ah. 


20091, partitioned from WordSim353 (Einkelstein 


et ah, 2002] ); Bruni et al. (2012| )’s MEN dataset; 


^http://nlp. stanford.edu/projects/glove/ 
‘*https://github.com/lsarlstratos/singular 
^https://github.com/mfaruqui/sparse-coding 


















































Table 2: Performance of each method across different tasks. 



Similarity Tasks 

Analogy Tasks 

Method 

WS Sim 

WS Rel 

MEN 

Turk 

SimLex 

Google 

MSR 

word2vec 

0.742 

0.543 

0.731 

0.663 

0.395 

0.734 / 0.742 

0.650 / 0.674 

PPMI 

0.735 

0.678 

0.717 

0.659 

0.308 

0.476 / 0.524 

0.183/0.217 

SVD 

0.687 

0.608 

0.711 

0.524 

0.270 

0.230 / 0.240 

0.123/0.113 

GloVe 

0.759 

0.630 

0.756 

0.641 

0.362 

0.535 / 0.544 

0.408 / 0.435 

Singular 

0.763 

0.684 

0.747 

0.581 

0.345 

0.440 / 0.508 

0.364 / 0.399 

Sparse 

0.739 

0.585 

0.725 

0.625 

0.355 

0.240 / 0.282 

0.253 / 0.274 

PSD-Reg-180K 

0.792 

0.679 

0.764 

0.676 

0.398 

0.602 / 0.623 

0.465 / 0.507 

PSD-Unreg-180K 

0.786 

0.663 

0.753 

0.675 

0.372 

0.566/0.598 

0.424 / 0.468 

PSD-25K 

0.801 

0.676 

0.765 

0.678 

0.393 

0.671/0.695 

0.533 / 0.586 


Radinsky et al. (20TT| )’s Mechanical Turk dataset; 
and (Hill et al., 20141’s SimLex-999 dataset. The 


embeddings were evaluated by the Spearman’s 
rank correlation with the human ratings. 

Word Analogy The two datasets are MSR’s 


analogy dataset (Mikolov et al., 2013c I, contain¬ 
ing 8000 questions, and Google’s analogy dataset 
( Mikolov et al., 2013a| l, with 19544 questions. Af¬ 
ter filtering questions involving out-of-vocabulary 
words, i.e. words that appear less than 100 times 
in the corpus, 7054 instances in MSR and 19364 
instances in Google were left. The analogy ques¬ 
tions were answered using 3CosAdd as well as 


3CosMul proposed by Levy et al. (20141. 


6.2 Results 

Table [2] shows the results on all tasks. Word2vec 
significantly outperformed other methods on anal¬ 
ogy tasks. PPMI and SVD performed much worse 


on analogy tasks than reported in (Levy et al.. 


20151, probably due to sub-optimal hyperparam¬ 
eters. This suggests their performance is unstable. 
The new embeddings yielded by Sparse systemat¬ 
ically degraded compared to the old embeddings, 
contradicting the claim in ( Faruqui et al., 2015] ). 

Our method PSD-Reg-180K performed well 
consistently, and is best in 4 similarity tasks. 
It performed worse than word2vec on analogy 
tasks, but still better than other MF-based meth¬ 
ods. By comparing to PSD-Unreg-180K, we see 
Tikhonov regularization brings 1-4% performance 
boost across tasks. In addition, on similarity tasks, 
online blockwise regression only degrades slightly 
compared to batch factorization. Their perfor¬ 
mance gaps on analogy tasks were wider, but this 
might be explained by the fact that some hard 
cases were not counted in PSD-25K’s evaluation. 


due to its limited vocabulary. 

7 Conclusions and Future Work 

In this paper, inspired by the link functions in 
previous works, with the support from Informa¬ 
tion Theory, we propose a new link function of a 
text window, parameterized by the embeddings of 
words and the residuals of bigrams. Based on the 
link function, we establish a generative model of 
documents. The learning objective is to hnd a set 
of embeddings maximizing their posterior likeli¬ 
hood given the corpus. This objective is reduced to 
weighted low-rank positive-semidehnite approxi¬ 
mation, subject to Tikhonov regularization. Then 
we adopt a Block Coordinate Descent algorithm, 
jointly with an online blockwise regression algo¬ 
rithm to hnd an approximate solution. On seven 
benchmark sets, the learned embeddings show 
competitive and stable performance. 

In the future work, we will incorporate global 
latent factors into this generative model, such as 
topics, sentiments, or writing styles, and develop 
more elaborate models of documents. Through 
learning such latent factors, important summary 
information of documents would be acquired, 
which are useful in various applications. 

Acknowledgments 

We thank Omer Levy, Thomas Mach, Peilin Zhao, 
Mingkui Tan, Zhiqiang Xu and Chunlin Wu for 
their helpful discussions and insights. This re¬ 
search is supported by the National Research 
Foundation, Prime Minister’s Office, Singapore 
under its IDM Futures Funding Initiative and ad¬ 
ministered by the Interactive and Digital Media 
Programme Office. 



































Appendix A Possible Trap in SVD 


Suppose M is the bigram matrix of interest. SVD 
embeddings are derived from the low rank approx¬ 
imation of M, by keeping the largest singular 
values/vectors. When some of these singular val¬ 
ues correspond to negative eigenvalues, undesir¬ 
able correlations might be captured. The follow¬ 
ing is an example of approximating a PMI matrix. 

A vocabulary consists of 3 words si, 52 ,^ 3 . 
Two corpora derive two PMI matrices: 

\ 0 0 2 / ’ Vo 02 / 

They have identical left singular matrix and sin¬ 
gular values (3,2,1), but their eigenvalues are 
(3, 2,1) and (—3, 2,1), respectively. 

In a rank-2 approximation, the largest two 
singular values/vectors are kept, and and 

yield identical SVD embeddings V = 
(°'o ^ *^ 0 ^ 1 ) rows may be scaled depending on 
the algorithm, without affecting the validity of the 
following conclusion). The embeddings of si and 
S 2 (columns 1 and 2 of V) point at the same di¬ 
rection, suggesting they are positively correlated. 
However as = —1.6 < 0, they are 

actually negatively correlated in the second cor¬ 
pus. This inconsistency is because the principal 
eigenvalue of is negative, and yet the corre¬ 
sponding singular value/vector is kept. 

When using eigendecomposition, the largest 
two positive eigenvalues/eigenvectors are kept. 

yields the same embeddings V. 
yields i 4 i) > which correctly 

preserves the negative correlation between si, S 2 . 


Appendix B Information Theory 

Redundant information refers to the reduced un¬ 
certainty by knowing the value of any one of the 
conditioning variables (hence redundant). Syner¬ 
gistic information is the reduced uncertainty as¬ 
cribed to knowing all the values of conditioning 
variables, that cannot be reduced by knowing the 
value of any variable alone (hence synergistic). 

The mutual information I{y,Xi) and the redun¬ 
dant information Rdn(y; xi, X 2 ) are defined as: 

I(y, Xi) = Ep^^.^y) [log 
Rdn{y,xi,X2) = Ep(^y)^ 


min E 

Xl,X2 


P{xi\y)\^0^ p^y-^ 


The synergistic information Syn(y;xi,X 2 ) is 
defined as the Pl-function in ([Williams and Bee~ 


20101, skipped here. 



Figure 2: Different types of information among 
3 random variables y,xi,X 2 - I{y; xi,X 2 ) is 
the mutual information between y and (xi,X 2 ). 
Rdn(y;xi,X 2 ) and Syn(?/; xi, X 2 ) are the redun¬ 
dant information and synergistic information be¬ 
tween xi, X 2 , conditioning y, respectively. 


The interaction information Int(xi, X 2 , y) mea¬ 
sures the relative strength of Rdn(y;xi,X 2 ) and 
Syn(?/; xi, X 2 ) (jTimme et ah, 2014 1 : 


Int(xi,X 2 ,?/) 

=Syn(y; xi, X 2 ) - Rdn(y; xi, X 2 ) 
=I{y;xi,X 2 ) - I{y,xi) - /(y;x 2 ) 


p n P{xi)P{x 2 )P{y)P{xi,X 2 , y) 

P{xux2,y) og P(xi,X2)P(xi,y)P(x2,y) 


Figure shows the relationship of different 
information among 3 random variables y,xi,X 2 
(based on Fig. 1 in (Williams and Beer, 20101). 

PMI is the pointwise counterpart of mutual 

information /. Similarly, all the above concepts 

have their pointwise counterparts, obtained by 

dropping the expectation operator. Specifically, 

the pointwise interaction information is defined as 

PInt(xi,X 2 ,y) = PMI(y;xi,X 2 ) -PMI(y;xi) - 

PMIf?;- Toi - lov P(^i)Pi^2)P(y)P{xuX2,y) 
fmi[y,X 2 ) — log p(xj^^x 2 )Pixi,y)Pix 2 ,y) ' 

If we know PInt(xi, X 2 , y), we can recover 

PMI(y;xi,X 2 ) from the mutual information over 

the variable subsets, and then recover the joint 

distribution P(xi, X 2 , y). 

As the pointwise redundant information 
PRdn(i/;xi,X 2 ) and the pointwise synergistic 
information PSyn(y; xi, X 2 ) are both higher- 
order interaction terms, their magnitudes are 
usually much smaller than the PMI terms. We 
assume they are approximately equal, and thus 
cancel each other when computing Pint. Given 
this. Pint is always 0. In the case of three 
words wo,wi,W 2 , PInt(r(;o,mi,ru 2 ) = 0 leads to 
PMI(t(;2; mg, mi) = PMI(t(;2; mg) -|-PMI(tt;2; mi). 
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